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Abstract 

When dealing with large scale gene expression studies, observations are commonly 
contaminated by unwanted variation factors such as platforms or batches. Not taking this 
unwanted variation into account when analyzing the data can lead to spurious associations 
and to missing important signals. When the analysis is unsupervised, e.g. when the goal 
is to cluster the samples or to build a corrected version of the dataset — as opposed 
to the study of an observed factor of interest — taking unwanted variation into account 
can become a difficult task. The unwanted variation factors may be correlated with the 
unobserved factor of interest, so that correcting for the former can remove the latter if not 
done carefully. We show how negative control genes and replicate samples can be used to 
estimate unwanted variation in gene expression, and discuss how this information can be 
used to correct the expression data or build estimators for unsupervised problems. The 
proposed methods are then evaluated on three gene expression datasets. They generally 
manage to remove unwanted variation without losing the signal of interest and compare 
favorably to state of the art corrections. 

1 Introduction 

Over the last few years, microarray-based gene expression studies involving a large number of 
samples have been carried out (Cardoso et al., 2007; Cancer Genome Atlas Research Network, 
2008), with the hope to help understand or predict some particular factors of interest like the 
prognosis or the subtypes of a cancer. Such large gene expression studies are often carried 
out over several years, may involve several hospitals or research centers and typically con- 
tain some unwanted variation (UV) factors. These factors can arise from technical elements 
such as batches, different technical platforms or laboratories, or from biological signals which 
are unrelated to the factor of interest of the study such as heterogeneity in ages or differ- 
ent ethnic groups. They can easily lead to spurious associations. When looking for genes 
which are differentially expressed between two subtypes of cancer, the observed differential 
expression of some genes could actually by caused by laboratories if laboratories are partially 
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confounded with subtypes. When doing clustering to identify new subgroups of the disease, 
one may actually identify any of the UV factors if their effect on gene expression is stronger 
than the subgroup effect. If one is interested in predicting prognosis, one may actually end 
up predicting whether the sample was collected at the beginning or at the end of the study 
because better prognosis patients were accepted at the end of the study. In this case, the 
classifier obtained would have little value for predicting the prognosis of new patients outside 
the study. Similar problems arise when doing a meta-analysis, i.e., when trying to combine 
several smaller studies rather than working on a large heterogeneous study : in a dataset 
resulting from the merging of several studies the strongest effect one can observe is generally 
related to the membership of samples to different studies. 

A very important objective is therefore to remove these UV factors without losing the 
factor of interest. The problem can be more or less difficult depending on what is actually 
considered to be observed and what is not. If both the factor of interest and all the UV 
factors (say technical batches or different countries) are known, the problem essentially boils 
down to a linear regression : the expression of each gene is decomposed as an effect of 
the factor of interest plus an effect of the unwanted variation factor. When the variance 
of each gene is assumed to be different within each batch, this leads to so-called location 
and scale adjustments such as used in dChip (Li and Wong, 2003). Johnson et al. (2007) 
shrink the unwanted variation and variance of all genes within each batch using an empirical 
Bayes method. This leads to the widely used ComBat method which generally give good 
performances in this case. Walker et al. (2008) propose a version of ComBat which uses 
replicate samples to estimate the batch effect. When replicate samples are available an 
alternative to centering, known as ratio-based method (Luo et al., 2010) is to remove the 
average of the replicate samples within each batch rather than the average of all samples. 
Assuming that the factor of interest is associated with the batch, this should ensure that 
centering the batches does not remove the signal associated with the factor of interest. Note 
that this procedure is closely related to some of the methods we introduce in this paper. 

Of course there is always a risk that some unknown factors also influence the gene expres- 
sion. Furthermore it is sometimes better when tackling the problem with linear models to 
consider UV factors which are actually known as unknown. Their effect may be strongly non- 
linear because they don't affect all samples the same way or because they interact with other 
factors, in which cases modeling them as known may give poor results. When the UV factors 
are modeled as unknown, the problem becomes more difficult because one has to estimate 
UV factors along with their effects on the genes and that many estimates may explain the 
data equally well while leading to very different conclusions. ICE (Kang et al., 2008) models 
unwanted variation as the combination of an observed fixed effect and an unobserved random 
effect. The covariance of the random effect is taken to be the covariance of the gene expres- 
sion matrix. The risk of this approach is that some of the signal associated with the signal 
of interest may be lost because it is included in the covariance of the gene expression matrix. 
SVA (Leek and Storey, 2007) addresses the problem by first estimating the effect of the fac- 
tor of interest on each gene then doing factor analysis on the residuals, which intuitively can 
give good results as long as the UV factors are not too correlated with the factor of interest. 
Teschendorff et al. (2011) propose a variant of SVA where the factor analysis step is done 
by independent component analysis (Hyvrinen, 2001) instead of SVD. The same model as 
Leek and Storey (2007) is considered in a recent contribution of Gagnon-Bartsch and Speed 
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(2012) coined RUV-2, which proposes a general framework to correct for UV in microarray 
data using negative control genes. These genes are assumed not to be affected by the fac- 
tor of interest and used to estimate the unwanted variation component of the model. They 
apply the method to several datasets in an extensive study and show its very good behavior 
for differential analysis, in particular comparable performances to state of the art methods 
such as ComBat (Johnson et al, 2007) or SVA (Leek and Storey, 2007). Sun et al. (2012) 
recently proposed LEAPS, which estimates the parameters of a similar model in two steps : 
first the effect of unwanted variation is estimated by SVD on the data projected along the 
factor of interest, then the unwanted variation factors responsible for this effect and the effect 
of the factor of interest are estimated jointly using an iterative coordinate descent scheme. A 
sparsity-inducing penalty is added to the effect of the factor of interest in order to make the 
model identifiable. Yang et al. (2012) adopt a related approach : they also use the sparsity- 
inducing penalty, do not have the projection step and relax the rank constraint to a trace 
constraint which makes the problem jointly convex in the unwanted variation and effect of 
the factor of interest. Listgarten et al. (2010) model the unwanted variation as a random 
effect term, like ICE. The covariance of the random effect is estimated by iterating between a 
maximization of the likelihood of the factor of interest (fixed effect) term for a given estimate 
of the covariance and a maximization of the likelihood of the covariance for a given estimate 
of the fixed effect term. This is also shown to yield better results than ICE and SVA. 

Finally when the factor of interest is not observed, the problem is even more difficult. It 
can occur if one is interested in unsupervised analyses such as PCA or clustering. Suppose 
indeed that one wants to use a large study to identify new cancer subtypes. If the study 
contains several technical batches, includes different platforms or different labs or any un- 
known factor, the samples may cluster according to one of these factors hence defeating the 
purpose of using a large set of samples to identify more subtle subtypes. One may also simply 
want to "clean" a large dataset from its UV without knowing in advance which factors of 
interest will be studied. Accordingly in the latter case, any knowledgeable person may want 
to start form the raw data and use the factor of interest once it becomes known to remove 
UV. Alter et al. (2000); Nielsen et al. (2002) use SVD on gene expression to identify the UV 
factors without requiring the factor of interest, Price et al. (2006) do so using axes of prin- 
cipal variance observed on SNP data. These approaches may work well in some cases but 
relies on the prior belief that all UV factors explain more variance than any factor of interest. 
Furthermore it will fail if the UV factors are too correlated with the factor of interest. If 
the factor of interest is not observed but the unwanted variation factor is assumed to be an 
observed batch, an alternative approach is to project the data along the batch factors, equiv- 
alently to center the data by batch. This is conceptually similar to using one of the location 
and scale adjustment methods such as Li and Wong (2003) or Johnson et al. (2007) without 
specifying the factor of interest. Benito et al. (2004); Marron et al. (2007) propose a distance 
weighted discrimination (DWD) method which uses a supervised learning algorithm to finds 
a hyperplane separating two batches and project the data on this hyperplane. As with the 
dChip and ComBat approaches, assuming that the unwanted variation is a linear function of 
the observed batch may fail if other unwanted variation factors affect gene expression or if 
the effect of the batch is a more complicated — possibly non-linear — function, or involves 
interaction with other factors. In addition, like the SVD approach, these projections may 
subsequently lead to poor estimation of the factor of interest if it is correlated with the batch 
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effect : if one of the batches contains most of one subtype and the second batch contains 
most of the other subtype the projection step removes a large part of the subtype signal. 
Finally, Oncomine (Rhodes et al., 2004b, a, 2007) regroups a large number of gene expression 
studies which are processed by median centering and normalizing the standard deviation to 
one for each array. This processing does not explicitly take into account a known unwanted 
variation factor or try to estimate it. It removes scaling effects, e.g. if one dataset or part 
of a dataset has larger values than others, but it does not correct for multivariate behaviors 
such as the linear combination of some genes is larger for some batch. On the other hand, it 
does not run the risk to remove biological signal of this form. 

The contribution of Gagnon-Bartsch and Speed (2012) suggests that negative controls 
can be used to estimate and remove efficiently sources of unwanted variation. Our objective 
here is to propose ways to improve estimation in the unsupervised case, i.e, when the factor of 
interest is not observed. We use a random effect to model unwanted variation. As we discuss 
in the paper, this choice is crucial to obtain good estimators when the factor of interest is 
not observed. In this context, two main difficulties arise : 

• In the regression case, and assuming the covariance of the random term is known, 
introducing such a random term simply turns the ordinary least square problem into a 
generalized least square one for which a closed form solution is also available. For some 
unsupervised estimation problems — such as clustering — dealing with the random 
effects can be more difficult : the dedicated algorithm — such as fc-means — may not 
apply or adapt easily to the adding of a random effect term. Our first contribution is 
to discuss general ways of adapting dedicated algorithms to the presence of a random 
effect term. 

• As always when using random effects, a major difficulty is to estimate the covariance 
of the random term. Following Gagnon-Bartsch and Speed (2012), we can use nega- 
tive control genes to estimate this covariance, but this amounts to assuming that the 
expression of these genes is really not influenced by the factor of interest. A second con- 
tribution of this paper is to propose new ways to estimate the unwanted variation factors 
using replicate arrays. Replicate arrays correspond to the same biological sample, but 
typically differ for some unwanted variation factors. For example in a dataset involving 
two platforms, some samples could be hybridized on the two platforms. More generally 
in a large study, some control samples or mixtures of RNAs may be re-hybridized at 
several times of the study, under different conditions. We still don't know what the 
factor of interest is, but we know it takes the same value for all replicates of the same 
sample. Changes among these replicates are therefore only caused by unwanted varia- 
tion factors, and we intend to use this fact to help identify and remove this unwanted 
variation. 

A third contribution of this paper is to assess how various correction methods, including 
the ones we introduce, perform in an unsupervised estimation task on three gene expression 
datasets subject to unwanted variation. We show that the proposed methods generally allow 
us to solve the unsupervised estimation problem, with no explicit knowledge of the unwanted 
variation factors. 

Section 2 recalls the model and estimators used in Gagnon-Bartsch and Speed (2012). 
Section 3 presents our methods addressing the case where the factor of interest is unobserved 
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using a random effect term. We first make the assumption that the unwanted variation fac- 
tors, or equivalently the covariances of the random terms are known. Section 4 discusses 
how estimation of this unwanted variation factor can be improved. In Section 5, the per- 
formance of the proposed methods are illustrated on three gene expression datasets : the 
gender data of Vawter et al. (2004) which were already used to evaluate the RUV-2 method of 
Gagnon-Bartsch and Speed (2012), the TCGA glioblastoma data (Cancer Genome Atlas Research Network, 
2008) and MAQC-II data from rat liver and blood (Lobenhofer et al., 2008; Fan et al., 2010). 
We finish with a discussion in Section 6. 

Notation For any matrix A G W n ' n , \\A\\f = [YliLi Sj=i a ?j) 3 denotes the Frobenius 

norm of A. In addition for a positive definite matrix B G R m > m we define = 

We refer to the problem where the factor of interest is observed as the supervised problem, 

and to the problem where it is unobserved as the unsupervised problem. 

2 Existing fixed effect estimators : RUV-2 and naive RUV-2 

The RUV (Removal of Unwanted Variation) model used by Gagnon-Bartsch and Speed (2012) 
was a linear model, with one term representing the effect of the factors of interest on gene 
expression and another term representing the effect of the unwanted variation factors : 

Y = Xp + Wa + e, (1) 

with Y G R mxn , X G R mx P, (3 G R pxn , W G R mxk , a G R kxn and e G M mxn . Y is the 
observed matrix of expression of n genes for m samples, X represents the p factors of interest, 
W the k unwanted variation factors and e some noise, typically £j ~ A/"(0, <r|/ m ), j = 1, . . . , n. 
Both a and /3 are modeled as fixed, i.e., non-random (Robinson, 1991). 

A similar model was also used in Leek and Storey (2007) and Listgarten et al. (2010). 
The latter used a mixed effect model where a was random. All these methods addressed the 
supervised problem, i.e., the case where X is an observed factor of interest and the goal is to 
estimate its effect /3 knowing that some other factors influence gene expression. They differ 
in the way parameters /3 and Wa are estimated. For the sake of clarity, we now recall the 
approach of Gagnon-Bartsch and Speed (2012) from which our unsupervised estimators are 
derived. 

2.1 Supervised RUV-2 

The method of Gagnon-Bartsch and Speed (2012) exploits the fact that some genes are known 
to be negative controls, i.e., not to be affected by the factor of interest. Formally we suppose 
that j3 c = where /3 C is the restriction of /3 to its column in some index c denoting the 
negative control genes. 

The objective in Gagnon-Bartsch and Speed (2012) is to test the hypothesis that (3j = 
for each gene j in order to identify differentially expressed genes. They use an intuitive 
two-step algorithm coined RUV-2 to estimate Wa and /3 : 

1. Use the columns of Y corresponding to control genes 

Y c = Wa c + e c , (2) 
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to estimate W. Assuming iid noise ~ A/"(0, a^I m ), j € c, the (Wa c ) matrix maximiz- 
ing the likelihood of (2) is argmin W / Q , cjrankW / Qc>fc \\Y C — Wa c Wp. By the Eckart- Young 

theorem (Eckart and Young, 1936), this argmin is reached for Wa c = UAkV T , where 
Y c = UAV T is the singular value decomposition (SVD) of Y c and is the diagonal 
matrix with the k largest singular values as its k first entries and on the rest of the 
diagonal. We can for example use W2 = UA^. 

2. Plug W2 back in the model (1) and do a regression to get a, ft. Assuming iid noise 
£j ~ A/"(0, tr|/ m ), j = 1, . . . , n, this estimates (a, /3) by maximizing the likelihood of (1) 
after doing a plug- in of W<i- 

The hypothesis f3j = can be tested using the regression model in step 2. More generally, 
RUV-2 can yield a corrected expression matrix Y = Y — W2&. 

2.2 Naive unsupervised RUV-2 

If we assume that in addition X is not specified, a simple approach discussed in Gagnon-Bartsch and Speed 
(2012) and called naive RUV-2 can be used to remove Wa. This is to find an estimate of W, 
e.g. doing factor analysis on control genes as in RUV-2, and then simply project the data in 
the orthogonal space of W2- More precisely, 

1. As in RUV-2, use the columns of Y corresponding to control genes to estimate W by 
maximizing the likelihood of (2) : W2 = UA^. 

2. Estimate a by doing a full regression of Y against W2 ■ a = (W 2 T W2)~ 1 W 2 T Y . This 
amounts to maximizing the likelihood of (1) after plugging in W2 and taking X/3 = 0. 

3. Once W and a are estimated, H^a can be removed from Y. The relevant unsupervised 
procedure to estimate X/3 (clustering algorithm, PCA...) can be applied to the corrected 
expression matrix Y — W2&. 

This approach is referred to as naive RUV-2 in Gagnon-Bartsch and Speed (2012) and is 
expected to work well as long as X and W are not too correlated. We study two directions 
that should improve estimation when X and W are not expected to be orthogonal. The first 
one in Section 3 is to use a random a version of (1). The second one in Section 4 is to explore 
new estimators of the unwanted variation factors W. We expect our estimators of Xf3 to 
be more sensitive to the quality of the estimated unwanted variation than in the supervised 
case : since X is not observed anymore there is a higher risk of removing the factor of interest 
from Y by including it in W2- 

3 Random a model for unsupervised estimation 

The supervised RUV-2 and naive RUV-2 estimators presented in Section 2 model both the 
variation caused by the factor of interest and the unwanted variation as a fixed effect. Mod- 
elling a as a random quantity leads to smoother corrections which can give better estimates 
of /3 in the supervised case. This is the model chosen by Listgarten et al. (2010), and if we 
use control genes to estimate W as discussed in Section 2 this leads to a random a version 
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of RUV-2 (Gagnon-Bartsch et al., 2012). We find that random a models can be especially 
helpful when X is unobserved. This section discusses random a based estimation of Xj3, how 
it can be done and why it can work better than the fixed a model of naive RUV-2. We start 
by introducing the random a model in the case where X is observed. 

3.1 Supervised random a model 

If one endows the columns of a with a normal distribution aj *~ A/"(0, a^Jfe), j = 1, . . . , n, 
then (1) can be re- written 

Y = X/3 + e, (3) 

where ij ~ JV(0,£), j = l,...,n for £ = a 2 a WW T + <7 £ I m . Xhe maximum likelihood 
estimator of /3 when £ is known is called the generalized least square estimator (Robinson, 
1991; Preedman, 2005) and is given by : 

P = argmin \\Y - Xp\\% = (X T T,~ 1 Xy 1 X T T,~ 1 Y, (4) 

P 

In this sense the random a versions of RUV discussed in Gagnon-Bartsch et al. (2012) 
are generalized least square estimators made feasible by appropriate estimators of S. 

Estimator (4) leads to a smoother correction of the unwanted variations because the 
scaling by S _1 only shrinks directions of Y which correspond to unwanted variation propor- 
tionally to the amount of variance observed in this direction. By contrast in the fixed effect 
version of RUV-2, the regression step against (X, W) fits as much signal as possible on every 
unwanted variation direction regardless how much variance was actually observed along this 
direction in the control genes. 

Using the posterior likelihood of model (1) with prior ay ~ A/"(0, cr^Jfe), j = 1, . . . ,n, we 
obtain a well known (Robinson, 1991) equivalent formulation of estimator (4) : 

P = argminmin l\\Y - Xp - Wa\\ 2 F + %||a||pl . (5) 
p a I < J 

In the supervised case this equivalence only serves to highlight the relationship between (4) 
and the fixed effect estimator since both (4) and (5) can be solved in closed form. In the 
unsupervised case, (4) can become hard to solve and we will use a similar identity to help 
maximize the likelihood. 

3.2 Unsupervised random a model 

We now consider an unsupervised version of (3). X is not observed anymore and becomes a 
parameter : we want to estimate Xp. This makes the estimation problem more difficult for 
several reasons. First, this obviously increases the number of parameters to be estimated. 
Second, this may introduce a product XP in the objective, making it non-convex in (X, p) - 
unless the problem can be formulated in terms of the Xp matrix as a parameter. Third, this 
may lead to the introduction of discrete constraints which make the resulting optimization 
problem even harder. Finally as we discussed in Section 1, existing algorithms which were 
designed to overcome these three difficulties may not adapt easily to non identity covariance 
matrices S, even in the case where this matrix is known. 
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Typical unsupervised estimation problems include clustering, where X is constrained to be 
a membership matrix and /3 is a real matrix whose rows are the cluster centers, PCA where 
Xf3 is only constrained to have a low rank, and sparse dictionary learning (Mairal et al., 
2010) where the £2 norm of the columns of X are constrained to be 1 and the l\ norm of f3 
is penalized. We first consider the estimation of Xfi assuming that E is known. Estimation 
of E is discussed in Section 4. 



3.2.1 Optimization 

A direct approach could be to solve the maximum likelihood equation of (3) for Xf3 : 

min \\Y-X/3\\l, (6) 

where A4 is a set of matrices which depends on the unsupervised problem. Solving (6) is 
difficult in general because of the structure of M which can be discrete, or at least non- 
convex. Dedicated algorithms often exist for the case where E = I m , but do not always 
generalize to non-spherical E. The purpose of this section is to reformulate (6) to make it 
solvable using the dedicated E = I m algorithm. 

We first discuss why (6) is difficult in the specific case of clustering to fix the ideas. In 
this case the problem can be stated as : 

min \\Y-XPWl, (7) 

xeC,/3eM. kxn 



where C denotes the set of cluster membership matrices 

C = { M e {0, l} mxfc , m m = !> i = 1, • • • , rn\ • 
L j= i > 

Problem (7) corresponds to a non-spherical version of the A:-means objective. It is hard 
because C is discrete, and because the objective is not jointly convex in (X, /3). The k- means 
algorithm (Hastie et al., 2001) finds a local minimum in the case where E = I m by iterating 
between fixing j3 to minimize (7) with respect to X and fixing X to minimize with respect 
to /3. However this classical algorithm cannot be simply adapted to solve (7) with a general 
E because when fixing (3 the solution for the rows of X are coupled by E whereas for the 
diagonal E of /c-means they are independent, and simply amount to assigning each sample to 
the cluster whose current center is the closest. A workaround is to estimate X by iteratively 
solving (7) for each row — keeping f3 and the other rows fixed. It is also possible to relax 
C by constraining the elements of X to be in [0, 1] instead of {0, 1}, and solve in X using a 
projected gradient descent. We observed empirically that for a fixed E this approach does 
well at minimizing the objective, but that the solution can be very sensitive to mediocre 
estimates of E. Instead we use a reformulation of (6) which allows us to use the algorithm 
for E = I m . 

The following proposition allows us to remove the E in problem (6) at the expense of 
introducing a particular correction term : 



S 



Proposition 1. Let R £ R mxn , W € M mxfc , > 0, i/ien 

{||E-^a||| + z/||a|||} = (8) 



mm 

C X n 



w/iere S(W, u) = v" 1 (WW T + ul m ) . 

This identity can be thought of as recapitulating normal Bayesian regression : the 
left hand side is the negative posterior log likelihood of Rj\aij *~ J\f(Waj, I) with prior 

ctj l ~ ^(O,^" 1 /), summed over all columns j. Using a Bayesian argument, this can also be 
written as the negative posterior log likelihood of the aj\Rj. The aj\Rj term disappears by 
minimization over ay and the remaining prior Rj term is the right hand side of (8), with 
covariance S(W, v) being the sum of the covariance of the conditional mean Cov[E[i2j|o;j]] = 
CovfWay] = u~ 1 WW T and the mean of the conditional variance E[Cov[i?j = I. The 
identity can also be verified by simply carrying the optimization over a on the left hand side, 
as detailed in Appendix A. 

We now detail how this identity can be used to solve (6) given a positive definite covariance 

— - T T 

matrix S. Defining W = U (A — ul) 2 U , where £ = UAU is the spectral decomposition 
of X and for any v smaller than the smallest eigenvalue of S, we obtain S(W, v) = S. Setting 
R = Y — Xj3, the right hand side of (8) becomes the objective of (6), which we need to 
solve for A/3 in order to solve our unsupervised estimation problem — but can be difficult as 
we illustrated with the clustering example (7). Proposition (8) suggests that we can instead 
solve : 

min min \\\Y -Wa-X8\\l + v a r , (y) 

XfSeM Q eM fcxn 

which generalizes (5) to the case where X is not observed. In the supervised case, closed 
form solutions were available for both (5) and (4). Now that X is unobserved Proposition 1 
implies that the formulations (6) and (9) are still minimized by the same set of A/3, but no 
closed form is available for either of them in general. However the A/3 part of (9) involves 

minx/?eM — A/3||^ on a corrected matrix Y = Y — Wa. The (9) formulation may therefore 
be minimized numerically provided that a dedicated algorithm, such as fc-means for clustering, 
is available to solve mmxp^M \\Y ~ A/3|||n, and regardless how difficult it is to minimize 
numerically \\Y — X/3\& for general £ under the constraints Ai. 

Note that more generally, defining W = U (A — JJ T for £ < v , we obtain S(W, v) = 

E + (u — i.e., a ridged version of S. In practice as discussed in Section 4, we use various 
estimators of S. In all cases, this estimator is based on a limited amount of data which can 
give it a high variance and make it poorly conditioned. For these two reasons, ridging is a 
good thing and in practice we use £ = and some v > which we choose using a heuristic 
we propose in Section 5. 

A joint solution for (A/3, a) is generally not available for (9). A first naive solution is to set 
A/3 = 0, maximize over a, and then apply the relevant unsupervised estimation procedure, 
e.g., /c-means to Y — Wa. More generally a possible way of maximizing the likelihood of 
A/3 is to alternate between a step of optimization over a for a given A/3, which corresponds 
to a ridge regression problem, and a step of optimization over A/3 for a given a using the 
relevant unsupervised estimation procedure for S = I m . Each step decreases the objective 
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\\Y — X/3 — Woi\\f + i^||a|||., and even if this procedure does not converge in general to the 
maximum likelihood of X/3, it may yield better estimates than the naive version. 

Finally note that Proposition 1 only holds for v > 0. For v = the objective on the left 
hand side of (8) is the joint negative log-likelihood of the fixed effect model (1). Fixing X/3 = 
and optimizing over a in this case yields the naive RUV-2 algorithm of Gagnon-Bartsch and Speed 
(2012). However it still makes sense to optimize the joint likelihood of (X/3, a) by a similar 
iterative scheme. In this case the step where we optimize over a becomes an ordinary linear 
regression. 

3.2.2 Benefit of random a for unsupervised estimation 

If W is estimated as in RUV-2, the naive method of setting X/3 = 0, maximizing over a 
and applying the relevant unsupervised estimation procedure yields a random-a version of 
naive RUV-2, i.e., one in which the regression step is replaced by a ridge regression. The 
benefit of this procedure compared to naive RUV-2 may not be obvious, as the only technical 
difference is to replace an ordinary regression by a ridge regression. But since a is estimated 
with X/3 = 0, the difference can be important if X and W are correlated. Figure 1 shows 
an example of such a case. The left panel represents genes for m = 2. Red dots are control 
genes, gray ones are regular genes. The largest unwanted variation W\ has some correlation 
with the factor of interest X. In naive RUV-2 with k = 1, the correction projects the samples 
in the orthogonal space of W\ , which can remove a lot of the signal coming from the factor of 
interest. This is illustrated on the center panel which shows the data corrected by naive RUV- 
2, i.e., by projecting the genes on the orthogonal space of W\. The projection removes all 
effect coming from W\ but also reduces a lot the association of genes with X. Note that this 
is true regardless of the amount of variance actually caused by W\ ■ the result would be the 
same with an almost spherical unwanted variation || Wi || — || W2II because once W is identified, 
the projection step of naive RUV-2 does not take into account any variance information. On 
the other hand, the projection does not account at all for the unwanted variation along Wi- 
By contrast, the random a correction shown on the right panel of Figure 1 takes the variance 
into account. The ridge regression only removes a limited amount of signal along each UV 
direction, proportional to the amount of variance that was observed in the control genes. In 
the extreme case where W\ is equal to X, the projection of RUV-2 removes all association 
of the genes with X. Provided that the amount of variance along W\ is correctly estimated, 
the random a model only removes the variation caused by W\, leaving the actual association 
of the genes with X. 

To summarize, given an estimate of W , we consider either a fixed or a random a model to 
estimate X/3. For each of these we have either the naive option of estimating a for X/3 = or 
to minimize the joint likelihood of (X/3, a) by iterating between the estimation of X/3 and a. 
The naive fixed a procedure corresponds to the naive RUV-2 of Gagnon-Bartsch and Speed 
(2012). We expect the other alternatives to lead to better estimates of X/3. 

3.3 Correlation on samples vs correlation on genes 

In model (3) we arbitrarily chose to endow a with a distribution, which is equivalent to 
introducing an m x m covariance matrix X on the rows of the Y — X/3 residuals. If we 
choose instead to model the rows of W as iid normal vectors with spherical covariance, we 
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Figure 1: Naive RUV-2 (fixed a) and random a based corrections, with m = 2. 

introduce a n x n covariance matrix S' on the columns of the Y — X/3 residuals. In the 
supervised case and if we consider a random j3 as well, the maximum a posteriori estimator 
of /3 incorporates the prior encoded in £' by shrinking the (5j of positively correlated genes 
towards the same value and enforcing a separation between the f3j of negatively correlated 
genes. This approach was used in Desai and Storey (2012). As an example if a € R lxn is 
a constant vector, i.e., if £' is a constant matrix with an additional positive value on its 
diagonal, the maximum a posteriori estimator of /3 boils down to the "multi-task learning" 
estimator (Evgeniou et al., 2005; Jacob et al., 2009) detailed in Appendix B. This model as 
is does not deal with any unwanted variation factor which may affect the samples. Using two 
noise terms in the regression model, one with correlation on the samples and the other one 
on the genes would lead to the same multi-task penalty shrinking some f3j together, but with 
a ||.|||, loss instead of the regular Frobenius loss discussed in Appendix B. 

Note that this discussion assumes £' is known and used to encode some prior on the 
residuals of the columns of Y — X/3. If however £' needs to be estimated, and estimation 
is done using the empirical covariance of Y — X/3, the estimators of f3 derived from (3) and 
from the model with an n x n covariance T,' on the Y — X/3 residuals become very similar, 
the only difference being that in one case the estimator of a\W, Y — X/3 is shrinked and in 
the other case the estimator of W\a, Y — Xj3 is shrinked. 

4 Estimation of W 

When X is observed it is usual to introduce the classical mixed effect or generalized least 
square regression (Robinson, 1991; Freedman, 2005) for a fixed covariance £ first and to 
discuss the practical estimation in a second step. Similarly for each of the options discussed 
in Section 3, we rely on a particular estimate of W. A common approach coined feasible 
generalized least squares (FGLS) in the regression case (Freedman, 2005) is to start with 
an ordinary least squares step to estimate (3 and then use some estimator of the covariance 
on Y — X/3. Listgarten et al. (2010) used a similar approach to estimate their covariance 
matrix. An alternative is to use the restricted maximum likelihood (REML) estimator which 
corrects for the fact that the same data are used to estimate the mean X/3 and the covariance, 
whereas the empirical covariance matrix maximizes the covariance assuming X/3 is known. 
Listgarten et al. (2010) mention that they tried both ML and REML but that this did not 
make a difference in practice. 
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In this section, we discuss estimation procedures for W which are relevant in the case 
where X is not observed. As a general comment, recall that fixed a models such as naive 
RUV-2 require rank(IU) < rank(Y) otherwise the entire Y is removed by the regression step, 
but random a models allow rank(lU) > rank(Y) because they lead to penalized regressions. 

So far, we considered the estimate Wi used in supervised RUV-2, which relies on the SVD 
of the expression matrix restricted to its control genes. This estimate was shown to lead to 
good performance in supervised estimation (Gagnon-Bartsch and Speed, 2012). Unsuper- 
vised estimation of X(3 may be more sensitive to the influence of the factor of interest X on 
the control genes : in the case of fixed a models, if the estimated W is very correlated with 
X in the sense of the canonical correlation analysis (Hotelling, 1936), i.e., if there exists a 
linear combination of the columns of X which has high correlation with a linear combination 
of the columns of W, then most of the association of the genes with X will be lost by the 
correction. Random a models are expected to be less sensitive to the correlation of W with 
X but could be more sensitive to poor estimates of the variance carried by each direction of 
unwanted variation. 

This suggests that unsupervised estimation methods could benefit from better estimates 
of W. We present two directions that could lead to such estimators. 

4.1 Using replicates 

We first consider "control samples" for which the factor of interest X is 0. In practice, one 
way of obtaining such control samples is to use replicate samples, i.e., samples that come 
from the same tissue but which were hybridized in two different settings, say across time or 
platform. The profile formed by the difference of two such replicates should therefore only 
be influenced by unwanted variation - those whose levels differ between the two replicates. 
In particular, the X of this difference should be 0. More generally when there are more than 
two replicates, one may take all pairwise differences or the differences between each replicate 
and the average of the other replicates. We will denote by d the indices of these artificial 
control samples formed by differences of replicates, and we therefore have X d = where X d 
are the rows of X indexed by d. 

4.1.1 Unsupervised estimation of Wa 

A first intuition is that such samples could be used to identify a the same way Gagnon-Bartsch and Speed 
(2012) used control genes to estimate W. Therefore, we start by presenting how control sam- 
ples can be used to estimate Wa in an unsupervised fashion, and then discuss how the 
procedure yields a new estimator of W. We consider the following algorithm : 

• Use the rows of Y corresponding to control samples 

Y d = W d a + e d , (10) 

to estimate a. Assuming iid noise £j ~ M(0,a^I m ), j = 1, . . . ,n, the (W d a) matrix 
maximizing the likelihood of (10) is argmin^/dQ ran ki4 /d a>fc \\Y d ~ W d a\\'p. By the same 
argument used in Section 2 for the first step of RUV-2, this argmin is reached for 
W d a = PEkQ T , where Y d = PEQ T is the singular value decomposition (SVD) of Y d 
and Ek is the diagonal matrix with the k largest singular values as its k first entries 
and on the rest of the diagonal. We can use a = E^Q^ . 
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• Plugging a in (2), the maximum likelihood of W is now solved by a linear regression, 
W = Y c a T c {a c a~l)~ 1 . 

• Once W and a are estimated, Wa can be removed from Y. 

X is not required in this procedure which in itself constitutes an unsupervised correction 
for Y. 

4.1.2 Comparison of the two estimators of W 

This procedure also yields an estimator of W, which can be plugged in any of the procedures 
we discussed in Section 3. The estimator Wi we considered so far was obtained using the 
first left singular vectors of the control genes Y c , which can also be thought of as a regression 
of the control genes on their first right singular vectors, i.e., the main variations EkQ T 
observed in the control genes. By contrast the estimator that we discuss here is obtained by 
a regression of the control genes against the main variations observed in the control genes 
for the control samples formed by differences of replicates. Assuming our control genes 
happen to be influenced by the factor of interest X, i.e., /3 C 7^ 0, the estimator of W solely 
based on control genes may have more association with X than it should, whereas the one 
using differences of replicate samples should not be affected. On the other hand, restricting 
ourselves to the variation observed in differences of replicates may be too restrictive because 
we don't capture unwanted variation when no replicates are available. 

To make things more precise, let us assume that the control genes are actually influenced 
by the factor of interest X and that /3 C ~ A/"(0,crl ). In this case we have E[Y C Y C T ] = 

XX T a\ + WW T a 2 a + I m al, so if we use Y c to estimate W or £ as we do for W% the estimate 
will be biased towards X. Let us now consider the estimator W r obtained by the replicate 
based procedure. To simplify the analysis we assume that k = d and therefore a = Y d in the 
procedure described in Section 4.1.1. Consequently Wa = Y c (Y d ) T (Y d (Y d ) T ^j 1 Y d . Define 

W r = Y c (Y c d ) T (Y d (Y d ) T )~* . Assuming X d is indeed equal to we can develop : 

W r = (Xp c +Wa c +e c ){W d a c +e d c ) T (w d a c aj (W d ) T + W d a c (e d c ) T + e d c aj (W d ) T + ef(^) T )~ . 

We now make some heuristic asymptotic approximations in order to get a sense of the behav- 
ior of W r - a c aj and e d (ef) T are Wishart variables which by the central limit theorem are 
close to cl m a\ and cl m o\ respectively if the number of control genes is large enough regard- 
less how good the control genes are, i.e., how small ap c is. In addition dot products between 
independent multivariate normal variables are close to in high dimension so we approximate 
/3 c aJ,/3 c eJ and a c sj by 0. The approximations involving /3 C depend in part how good the 
control genes are, but can still be valid for larger crg c if the number of control gene is large 
enough. We further assume that a £ <C (J a an d that the control samples are independent from 
the samples for which we estimate W and ignore the cl m a\ and £ c (£ d ) T terms. Implementing 

all these approximations yields W r ~ a a c^W(W d ) T (W d (W d ) T ) 1 . Writing W d = AAB T 
for the SVD of W d , we obtain W r ~ a a c2WB r Aj , where r is the rank of W d and A r ,B r 
contain the first r columns of A and B respectively. This suggests that if W d has rank k W r 
is a good estimator of W in the sense that it is not biased towards X even if the control genes 
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are influenced by X. If W d is column rank deficient, the B r mapping can delete or collapse 
unwanted variation factors in W r . The effect is easier to observe on the estimator of £ : 
W r Wj ~ a^cW B r Bj W T . Consider for example the following case with 3 unwanted vari- 
ation factors and 3 replicate samples with unwanted variation (1,0,3), (0,1,3) and (1,1,3). 
The W d and corresponding B r Bj obtained by taking differences between replicates (1,2), 
(1, 3) and (3, 2) are 

/ 1 -1 \ / 1 \ 

W d = -1 and B r Bj = 1 , 

\ 1 / \ / 

so the B r Bj factor removes the third factor from the estimate of S. This is because the 
3 replicates have the same value for the third factor. Similarly if two factors are perfectly 
correlated on the replicate samples, e.g., the first two factors for (1, 1,0), (0, 0, 1) and (1, 1, 1), 
the W d and corresponding B r Bj for the same differences between replicates (1,2), (1,3) and 
(3, 2) are 

/ 1 1 -1 \ / 1/2 1/2 \ 

^= 0-1 and B r Bj = 1/2 1/2 , 

\ i i o / \ o o i / 

which collapses the first two factors into an average factor and leaves the third one unchanged. 

Finally, another option in the context of random a models is to combine the control 
gene based and replicate based estimators of W by concatenating them. In terms of S, this 
amounts to summing the two estimators of the covariance matrix. This may help if, as in 
our first example, some factors are missing from W r because all pairs of replicates have the 
same value for these factors. In this case, combining it with W2 could lead two an estimate 
containing less X but still containing all the unwanted variation factors. 

4.2 Using residuals 

We already mentioned that in the case where X is observed, a common way of estimating 
W known as FGLS is to first do an ordinary regression of Y against X, then compute the 
empirical covariance on the residuals Y — X/3. The estimators of W that we discussed work 
around the estimation of X/3 by using genes for which /3 is known to be or samples for 
which X is known to be 0. Once we start estimating Xf3, e.g., by iterating over X/3 and a 
as described at the end of Section 3 we can use a form of FGLS and re-estimate W using 
Y — X/3. If the current estimator of X/3 is correct, this amounts to making all the genes 
control genes, and all the samples control samples. 

4.3 Using a known W 

Finally in some cases we may want to consider that W is observed. For example, if the dataset 
contains known technical batches, involves different platforms or labs, W could encode these 
factors instead of being estimated from the data. In particular if the corresponding W is a 
partition of the samples, then naively estimating a by regression using X/3 = and removing 
Wa from Y corresponds to mean-centering the groups defined by W. In most cases however, 
this procedure or its shrunken equivalent doesn't yield good estimates of X/3. This was also 
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observed by Gagnon-Bartsch and Speed (2012) in the supervised case. One reason is that this 
W only accounts for known unwanted variation when other unobserved factors can influence 
the gene expression. The other one is that this approach leads to a linear correction for 
the unwanted variation in the representation used in W. If we know that gene expression is 
affected by the temperature of the scanner, setting a column of W to be this temperature 
leads to a linear correction whereas the effect of the temperature may be quadratic, or involve 
interactions with other factors. In this case, estimating W implicitly allows us to do a non- 
linear correction because the estimated W could fit any non-linear representation of the 
observed unwanted variation which actually affects gene expression. 

5 Result 

We now evaluate our unsupervised unwanted variation correction methods on several mi- 
croarray gene expression datasets. We focus on clustering, a common unsupervised analysis 
performed on gene expression data (Speed, 2003) typically to identify subtypes of a dis- 
ease (Perou et al., 2000). Note that the proposed method is in no way restricted to cluster- 
ing, which is simply used here to assess how well each correction did at removing unwanted 
variation without damaging the signal of interest. 

Because of the unsupervised nature of the task, it is often difficult to determine whether 
the partition obtained by one method is better than the partition obtained by another 
method. We adopt the classical strategy of turning supervised problems into unsupervised 
ones : for each of the three datasets we consider in this section, a particular grouping of the 
samples is known and considered to be the factor of interest which must be recovered. Of 
course, the correction methods are not allowed to use this known grouping structure. 

In order to quantify how close each clustering gets to the objective partition, we adopt 
the following squared distance (Bach and Harchaoui, 2007) between two given partitions 
C = (ci, and C = (c' 1; c' k ) of the samples into k clusters : 



This score ranges between when the two partitionings are equivalent, and k — 1 when 
the two partitions are completely different. To give a visual impression of the effect of the 
corrections on the data, we also plot the data in the space spanned by the first two principal 
components, as PCA is known to be a convex relaxation of /c-means (Zha et al., 2001). For 
each of the correction methods that we evaluate, we apply the correction method to the 
expression matrix Y and then estimate the clustering using a fe-means algorithm. 

In addition to the methods we introduced in Sections 3 and 4 of this paper, we consider 
as baselines : 

• An absence of correction, 

• A centering of the data by level of the known unwanted variation factors. 

• The naive RUV-2 procedure of Gagnon-Bartsch and Speed (2012). 
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Centering of the data by level of the known unwanted variation factors means that if 
several unwanted variation factors are expected to affect gene expression, we center each 
group of samples having the same value for all of the unwanted variation factors. Of course 
this implies that unwanted variation factors are known and this may not correct the effect of 
unobserved factors. 

On each dataset, we evaluate three basic correction methods : the unsupervised replicate- 
based procedure described in Section 4.1.1, the random a model with W = W2 and the 
random a model with W combining W2 and W r . For each of these three methods, we also 
evaluate an iterative version which alternates between estimating a using a ridge regression 
and estimating X(3 using the sparse dictionary learning technique of (Mairal et al., 2010). 
The sparse dictionary estimator finds (X, (3) minimizing the objective : 



under the constraint that X/3 has rank p and the columns of X have norm 1. The problem 
is not jointly convex in (X, /3) so there is no guarantee of obtaining a global minimizer. For 
the iterative methods, we also re-estimate W using the Y — X(3 residuals as discussed in 
Section 4.2 every 10 iterations. Many other methods are possible such as iterative fixed a 
(for different estimators of W), random a using the W r estimator introduced in Section 4.1.1 
with or without iterations, with or without updating W, or replacing the sparse dictionary 
estimator by other matrix decomposition matrix techniques such as convex relaxations of 
fc-means. We restrict ourselves to just the set of methods mentioned for the sake of clarity 
and because we think they illustrate the benefit of the different options that we discuss in 
this paper. 

Some of the methods require the user to choose some hyper-parameters : the ranks k of 
Wa and p of X(3, the ridge v and the strength A of the l\ penalty on f3. In order to decrease 
the risk of overfitting and to give good rules of thumb for application of these methods to 
new data, we tried to find relevant heuristics to fix these hyperparameters and to use them 
consistently on the three benchmarks. Some methods may have given better performances 
on some of these datasets for different choices of the hyperparameters. We discuss empirical 
robustness of the methods to their hyperparameters for each dataset. The rank k of Wa 
was chosen to be close to m/4, or to the number of replicate samples when the latter was 
smaller than the former. For methods using p, we chose p = k. For random a models, we use 
k = m : the model is regularized by the ridge v. We do not combine it with a regularization 
of the rank. Since v is a ridge parameter acting on the eigenvalues of W T W, we chose 
it to be cji(VF t T / F) x 10~ 3 , where <J\{A) is the largest eigenvalue of a positive semidefinite 
matrix A. Special care must be taken for the iterative methods : their naive counterparts 
use Xj3 = and therefore minimize \\Y — Wa\\ F , whereas the iterative estimators minimize 
\\Y — Xp — Wa\\ 2 F which can lead to corrections of smaller norm ||Wq;||f- In order to make 
them comparable, we choose A such that ||Wa||i? is close to the one obtained with the non- 
iterative algorithm. 

All experiments were done using R. For the experiments using sparse dictionary algo- 
rithms, we used the SPAMS software 1 developed by the first author of Mairal et al. (2010) 
under GPL license and for which an R interface is available. 

1 http://spams-devel. gforge.inria.fr/ 




16 



5.1 Gender data 



We first consider a 2004 dataset of patients with neuropsychiatric disorders. This dataset 
was published by Vawter et al. (2004) in the context of a gender study : the objective was 
to study differences in gene expression between male and female patients affected by these 
neuropsychiatric disorders. It was already used in Gagnon-Bartsch and Speed (2012) to study 
the performances of RUV-2 : the number of genes from the X and Y chromosomes which were 
found significantly expressed between male and female patients was used to assess how much 
each correction method helped. This gender study is an interesting benchmark for methods 
aiming at removing unwanted variation as it expected to be affected by several technical and 
biological factors : two microarray platforms, three different labs, three tissue localizations in 
the brain. Most of the 10 patients involved in the study had samples taken from the anterior 
cingulate cortex (a), the dorsolateral prefontal cortex (d) and the cerebellar hemisphere (c). 
Most of these samples were sent to three independent labs : UC Irvine (I), UC Davis (D) 
and University of Michigan, Ann Arbor (M). Gene expression was measured using either 
HGU-95A or HGU-95Av2 Affymetrix arrays with 12, 600 shared between the two platforms. 
Six of the 10 x 3 x 3 combinations were missing, leading to 84 samples. We use as control 
genes the same 799 housekeeping probesets which were used in Gagnon-Bartsch and Speed 
(2012). We use all possible differences between any two samples which are identical up to 
the brain region or the lab, leading to 106 differences. Note that using all these replicates 
leads to non-iid data in the replicate based methods : each sample is dependent on all the 
differences in which it is involved. The discussion in Section 4.1.2 suggests that this could 
bias the estimation of W. However we seem to obtain reasonable results for this dataset. As 
a pre-processing, we also center the samples per array type. 

Since most genes are not affected by gender, and because fc-means is known to be sensitive 
to noise in high dimension, clustering by gender gives better results in general when removing 
genes with low variance before applying /c-means. For each of the methods assessed, we 
therefore try clustering after filtering different numbers of genes based on their variance on 
the corrected. Figure 2 shows the clustering error (11) for the methods against the number 
of genes retained. The uncorrected and mean-centering cases are not displayed to avoid 
cluttering the plot, but give values above 0.95 for all numbers of genes retained. Figure 3 
shows the samples in the space of the first two principal components in these two cases, 
keeping the 1260 genes with highest variance. On the uncorrected data (left panel), it is clear 
that the samples first cluster by lab which is the main source of variance, then by brain region 
which is the second main source of variance. This explains why the clustering on uncorrected 
data is far away from a clustering by gender. Mean-centering samples by region-lab (right 
panel) removes all clustering per brain region or lab, but doesn't make the samples cluster by 
gender. The gray line of Figure 2 shows the performance of naive RUV-2 for k = 20. Since 
naive RUV-2 is a radical correction which removes all variance along some directions, it is 
expected to be more sensitive to the choice of k. The estimation is damaged by using k = 40 
(clustering error 0.99). Using k = 5 also degrades the performances, except when very few 
genes are kept. 

The purple lines of Figure 2 represent the replicate-based corrections. The solid line shows 
the performances of the non-iterative method described in Section 4.1.1. Its performances 
are similar to the ones of naive RUV-2 in this case, except when very few genes are selected 
and the replicate-based method leads to a perfect clustering by gender. As for the naive 



17 



00 

o 

o 

o <° 
c 



O 



CM 

o 

o 
o 




200 600 1000 

Filtered in genes 

Figure 2: Clustering error against number of genes selected (based on variance) before clus- 
tering. Gray line : naive RUV-2, purple lines : replicate-based corrections, green lines : 
random a model using W2, red lines : random a model using a combination of W2 and W r . 
Full lines : no iteration, dotted lines : with iterations. 
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Figure 3: Samples of the gender study represented in the space of their first two principal 
components before correction (left panel) and after centering by lab plus brain region (right 
panel). Blue samples are males, pink samples are females. The labels indicate the laboratory 
and brain region of each sample. The capital letter is the laboratory, the lowercase one is the 
brain region. 
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Figure 4: Using replicates. Left: no iteration, right: with iterations 



RUV-2 correction, using a k too large damages the performances. However, the replicate- 
based correction is more robust to overestimation of k : using k = 40 only leads to a 0.83 
error. This is also true for the other benchmarks, and can be explained by the fact that the 
correction is restricted to the variations observed among the contrasts of replicate samples 
which are less likely to contain signal of interest. 

The iterative version in dotted line leads to much better clustering except again when 
very few genes are selected. Figure 4 shows the samples in the space of the first two principal 
components after applying the non-iterative (left panel) and iterative (right panel) replicate- 
based method. The correction shrinks the replicates together, leading to a new variance 
structure, more driven by gender although not separating perfectly males and females. One 
may wonder whether this shrinking gives a better partition structure than clustering each 
region-lab separately, e.g. clustering only the samples taken from the cerebellar hemisphere 
and sent to UC Irvine. This actually leads to clustering errors higher than 0.89 for all 
region-lab pairs, even when keeping only few genes based on their variance. 

The green lines of Figure 2 correspond to the random a based corrections using the 
estimate of W computed on control genes only. The solid line shows the results for the non- 
iterative method. Note that in this case, the v = a\(W T W) x 10~ 3 is not optimal, as larger 
v lead to much lower clustering errors. This correction yields good results on this dataset, as 
illustrated by the reasonably good separation obtained in the space of the first two principal 
components after correction on the left panel of Figure 5. Note that the variance ||VFa||^ 
removed by the random a correction is larger that the one removed by naive RUV-2 and the 
replicate-based procedure. However the advantage of ridge RUV-2 does not come from its 
doing more deflation, it is rather a handicap here. As we discussed, random a with a larger 
penalty v, and therefore a smaller ||Wa||ir, leads to better results. If we increase k for the 
naive and replicate based procedures to get similar ||Wa|| as the random a correction, we 
obtain clustering scores of 0.99 and 0.83 respectively. 

In order for the random a correction to work, it is crucial to have a good estimate of £ 
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Figure 5: Random alpha with control genes only. Left: no iteration, right: with iterations 



or equivalently a good estimate W with the information of how much variance is caused by 
unwanted variation along each direction. As a consequence, it is crucial in the unsupervised 
case, even more than in the supervised case, to have good control genes. This is probably why 
the random a based correction works well on gender data : few genes are actually influenced 
by gender so in particular the housekeeping genes contain little signal related to the factor 
of interest. The dotted green line of Figure 2 corresponds to the random a based corrections 
with iterations plus sparsity, which in this case leads to lower clustering errors. Here again, 
sparsity works well because lots of genes are not affected by gender. As a sanity check, we 
tried adding random genes to the control genes used in non-iterative methods but this did 
not lead to a clear improvement. 

Finally, the red lines of Figure 2 correspond to the random a estimators using a combi- 
nation Wi and W r . The full line corresponds to the non- iterative method, and shows that 
combining the two estimators of W leads to a better correction than the two other methods 
which each rely on only one of these estimators. Recall however that the purple line uses W r 
in a fixed a model, and therefore illustrates the improvement brought by using replicates to 
estimate W, compared to the gray line which only uses negative control genes. Using W r 
with a random a model (not shown) leads to better performances than W^- Combining the 
estimators further improves the performances, although not by much. As for the previous 
methods, adding iterations to better estimate a improves the quality of the correction. 

We also repeated the same experiment without centering the samples by array type. The 
performances were essentially the same as in the non-centered case for all methods, except 
that replicate based corrections do not work anymore and give clustering error 1 for all 
numbers of filtered genes. This is expected since we don't have replicates for array types and 
the replicate-based method has no way to correct for something which does not vary across 
the replicates. Using control genes however allows us to identify and remove the array type 
effect. 
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Figure 6: Random alpha with control genes only. Left: no iteration, right: with iterations 

5.2 TCGA glioblastoma data 

5.2.1 Data 

We now illustrate the performances of our method on the gene expression array data generated 
in the TCGA project for glioblastoma (GBM) tumors (Cancer Genome Atlas Research Network, 
2008). These tumors were studied in detail in Verhaak et al. (2010). For each of the 460 sam- 
ples, gene expression was measured on three different platforms : Affymetrix HT-HG-U133A 
Genechips at the Broad Institute, Affymetrix Human Exon 1.0 ST Genechips at Lawrence 
Berkeley Laboratory and Agilent 244K arrays at University of North Carolina. Verhaak et al. 
(2010) selected 200 tumors and 2 normal samples from the dataset based on sample quality 
criterions and filtered 1740 genes based on their coherence among the three platforms and 
their variability within each platform. The expression values from the three platforms were 
then merged using factor analysis. They identified four GBM subtypes by clustering analysis 
on this restricted dataset : Classical, Mesenchymal, Proneural and Neural. We study these 
202 samples across the three platforms, keeping all the 11861 genes in common across the 
three platforms. Among these 202 samples, 173 were identified by Verhaak et al. (2010) as 
"core" samples : they were good representers of each subtypes. 38 of them are Classical, 56 
Mesenchymal, 53 Proneural and 26 Neural. 

5.2.2 Design 

For the purpose of the experiment, we study how well a particular correction allows us to 
recover the correct label of the 147 Classical, Mesenchymal and Proneural tumors, leaving 
the other ones aside. Our objective is to recover the correct subtypes using a fc-means with 
3 clusters. We consider two settings. In the first one, we use a full design with all 147 
samples from 3 platforms. In the second one we build a confounding setting in which we only 
keep the Classical samples on Affymetrix HT-HG-U133A arrays, the Mesenchymal samples 
on Affymetrix Human Exon arrays and the Proneural samples on Agilent 244K arrays. In 
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each case, we use 5 randomly selected samples that we keep for all 3 platforms and use as 
replicates. We do not use other samples as replicates even in the full design when all samples 
could potentially be used as replicates. Among the 5 selected samples one was Neural, two 
Proneural, and two were not assigned a subtype. The results presented are qualitatively 
robust to the choice of these replicates. 

In the confounded design, a correction which simply removes the platform effect is likely 
to also lose all the subtype signal because it is completely confounded with the platform, 
up to the replicate samples. The reason why we only keep 3 subtypes is to allow such a 
total confounding of the subtypes with the 3 platforms. However in this design, applying no 
correction at all is likely to yield a good clustering by subtype because we expect the platform 
signal to be very strong. A good correction method should therefore perform well in both 
the confounded and the full design. In the full design, the platform effect is orthogonal to the 
subtype effect so we expect the correction to be easier. Of course in this case, the uncorrected 
data is expected to cluster by platform which this time is very different from the clustering 
by subtype since each sample is present on each platform. 

5.2.3 Result 



Method 


Full 


Confounding 


No correction 


2 





Mean-centering 


0.3 


1.93 


Ratio method 


0.31 


0.79 


Naive RUV-2 


2 





Random a 


0.21 


1.5 


+ iterations 


0.15 


1 


Replicate based 


0.2 


0.61 


+ iterations 


0.17 


0.16 


Combined 


1.3 


1.2 


+ iterations 


1.3 


1 



Table 1: Clustering error of TCGA GBM data with full and confounded designs for various 
correction methods. Since there are 3 clusters, errors range between and 2. 

Table 1 shows the clustering error obtained for each correction method on the two de- 
signs. Recall that since there are 3 clusters, clustering errors range between and 2. Most 
of the plots of the corrected data in the space spanned by the first two principal components 
are deferred to appendix C. As expected, the uncorrected data give a maximal error on the 
full design and in the presence of confounding. This is because, as seen on Figure 7, the 
uncorrected data cluster by platform which in the full design are orthogonal to the subtypes 
and in the second design are confounded with subtypes. For similar reasons, centering the 
data by platform works well in the full design but fails when there is confounding because 
removing the platform effect removes most of the subtype effect. When replicates are avail- 
able, a variant of mean centering is to remove the average of replicate samples from each 
platform. This is known as the ratio method (Luo et al., 2010) and does improve on regular 
mean-centering in the presence of confounding. A disadvantage of this method is that it 
amounts to considering that W is a partition of the data by batch (in this case by platform) 
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Figure 7: Uncorrected full design GBM data in the space of their first four principal compo- 
nents. Left panel : PC 1 and 2, right panel : PC 3 and 4. Colors denote subtypes, shapes 
denote platforms. 

whereas as discussed in Section 4.3 the actual unwanted variation may be a non linear func- 
tion of the batch, possibly involving other factors. Note that we do not explicitly assess the 
ratio method for the other benchmarks because all samples are used as replicates so the ratio 
methods becomes equivalent to mean centering. 

Naive RUV-2 gives a maximal error in the full design, otherwise. This result is actually 
caused by the fact that naive RUV-2 is extremely sensitive to the choice of k. Since the total 
number of differences formed on the 15 replicates is 10, we use k = 2 as a default for naive 
RUV-2 and the replicate based procedure. In the full design, the platform effect is contained 
in the first three principal components of the control gene matrix Y c and the fourth principal 
component contains the subtype effect. This can clearly be seen on Figure 7. Removing 
only one or two directions of variance leaves too much platform effect. Removing a third one 
(k = 3) gives a small error of 0.15 and removing a fourth one gives an error of 1.83. When the 
platform is confounded with the subtype, removing one or two components leads to a perfect 
clustering by subtype because the third principal component still contains platform/subtype 
signal. Removing more does not allow us to recover a clustering by subtype. So if we used 
k = 3 instead of k = 2, the result would be inverted : good for the full design, bad in 
the presence of confounding. The random a model works well in the full design, less so in 
the presence of confounding, as illustrated on Figure 14 and 17 respectively. While it was 
reasonably robust to the choice of the ridge parameter v on the gender data, it is more 
sensitive on this one. Using u = a\{W~^ W) x 5.10 -2 instead of 10~ 3 does not remove enough 
platform effect and leads to an error of 2 on the full design, in the presence of confounding. 
Using a smaller factor of 5.10 -4 leads to an error of 0.27 (1.65 with confounding), and 10 -4 
to an error of 1.94 (1.98 with confounding). Because the correction made by the random a 
model is softer than the one of the fixed a naive RUV-2, using u = a\iW~^W) x 10~ 3 allows 
us to recover subtype signal in both designs. The sensitivity to v is likely to be caused by 
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the large difference of magnitude between a±(W T W) and the next eigen values : the first 
one represents 98% of the total variance. This is to be expected in most cases in presence 
of a strong technical batch effect. In both designs, using iterating between estimation of X(3 
using sparse dictionary learning and estimation of a using ridge regression further improves 
the performances. The replicate-based correction gives good results for both designs, as 
illustrated on Figure 15 and 18. Like for the gender data, it seems to be robust to the choice 
of k. For k = 10 it gives errors 0.2 and 0.53 in the first and second design respectively. 
Here again, adding iterations on (X(3, a) improves the quality of the correction in each case. 
Finally the combined method gives a clustering error of 1.3 and 1.2 in the first and second 
design respectively. This is not a very good performance but the method still manages to 
retain some of the factor of interest and does so consistently in the presence and in the 
absence of confounding. The corresponding corrected data are shown in Figure 16 and 19 
for the full and confounded design respectively. Iterations improve the performance in the 
confounding design and do not change the result in the full design. Figure 16 suggests that 
for the full design, the problem comes from the fact that the correction does not remove all 
of the platform effect. 

Note that, as with the gender data, the difference observed between the correction meth- 
ods cannot be only explained by the fact that some of them remove more variance than others. 
For example in the full design, naive RUV-2, the replicate based procedure and the random 
a correction lead to similar ||IUa||_F but to very different performances : naive RUV-2 fails 
to remove enough platform signal whereas the other corrections remove enough of it for the 
arrays to cluster by subtype. In the confounding design, both naive RUV-2 and the replicate 
based procedure lead to similar ||Wa||.F and similar performances. The random a correction 
leads to a larger || Wck||_p which explains its poor behavior. Estimating what amount of vari- 
ance should be removed is part of the problem, so it would not be correct to conclude that 
the random a correction works as well as the others in this case. It is however interesting to 
check whether the problem of a particular method is its removing too much or not enough 
variance or whether it is a qualitative problem. 

5.3 MAQC-II data 

We finally assess our correction methods on a gene expression dataset which was generated 
in the context of the MAQC-II project (Shi et al., 2010). The study was done on rats and 
the objective was to assess hepatotoxicity of 8 drugs. For each drug, three time points were 
done for three different doses. For each of these 8x3x3 combinations, 4 animals were tested 
for a total of 288 animals. For each animal, one blood and one liver sample were taken. Gene 
expression in blood and in the liver were measured using Agilent arrays and gene expression 
in the liver was also measured using Affymetrix arrays. The Agilent arrays were loaded using 
the marray R package. Each array was loess normalized, dye swaps were averaged and each 
gene was then assigned the median log ratio of all probesets corresponding to the gene. The 
Affymetrix arrays were normalized using the gcrma R package. Each gene was then assigned 
the median log ratio of all probesets corresponding to the gene. For this experiment we retain 
samples from all platforms and tissues for the highest dose of each drug and for the last two 
time points 24 hours and 48 hours. Most of these drugs are not supposed to be effective for 
the earlier time points. This leads to a set of 186 arrays that we restrict to the 9502 genes 
which are common to all platforms. Each sample has a replicate for each tissue and platform, 
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but there is no replicate against the time effect. For control genes, we used the same list of 
housekeeping genes as for the other datasets but converted to their rat orthologs, leading to 
210 control genes. 

The interest of this complex design is obvious for the purpose of this paper : the result- 
ing dataset contains a large number of arrays measuring gene expression influenced by the 
administered drug which we consider to be our factor of interest and by numerous unwanted 
variation factors. Array type, tissue, time and dose are likely to influence gene expression, 
preventing the arrays from clustering by drug. This clustering problem is much harder than 
the gender and glioblastoma ones. First of all, the drug signal may not be as strong as the 
gender which at least for a few genes is expected to be very clear or as the glioblastoma sub- 
types which were defined on the same dataset. Second and maybe more important, it is an 
8-class clustering problem, which is intrinsically harder than 2- or 3-class clusterings. Finally 
as we discuss in Section 5.4, the control genes for this dataset do not behave as expected. 



Method 


Error 


No correction 


5.9 


Mean-centering 


5.1 


Naive RUV-2 


6.6 


Random a 


4.7 


+ iteration 


5.4 


Replicate based 


2.8 - 3.8 


+ iterations 


2.8 - 3.8 


Combined 


5.7 


+ iterations 


5.7 



Table 2: Clustering error of MAQC-II data for various correction methods. 

The errors obtained after applying each correction are displayed in Table 2. Recall that 
for this dataset, we are trying to recover a partition in 8 classes corresponding to the 8 drugs 
of the study so the maximum clustering error is 7. The left panel of Figure 8 represents the 
uncorrected samples in the space of the first two principal components. The first principal 
components is clearly driven by the presence of two different types of arrays. The clustering 
error in this case is 5.9. Centering by platform-tissue, i.e. centering separately the Affymetrix 
arrays, the Agilent liver and the Agilent blood, the data points do not cluster by platform 
anymore but just like for the gender data this does not lead to a clear clustering by drug. 
This can be seen on the right panel of Figure 8. The resulting clustering error is 5.1. The 
naive RUV-2 correction doesn't lead to any improvement compared to the uncorrected data, 
leading to an error of 6.6. The random a estimator hardly improves the performances, and 
its iterative variant even increases a little the error. Figure 9 shows that these methods 
lead to a better organization of the samples by drug, but still far from a clean clustering. 
The replicate-based method leads to better performances. Even though we do 200 runs of 
/c-means to minimize the within sum of square objective, different occurrences of the 200 
runs lead to different clusterings with close objectives. We choose to indicate the range of 
clustering errors given by these different clusterings (2.8-3.8). The iterative version of the 
estimator gives the same range of errors. Figure 10 shows that these corrections indeed 
lead to a better organization of the samples by drugs in the space spanned by the first 
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Figure 8: Samples of the MAQC-II study represented in the space of their first two principal 
components before correction (left panel) and after centering by tissue and platform region 
(right panel). Each color represents a different drug. The labels indicate the platform of each 
sample. 



two principal components, but fails to correct the time effect against which no replicate is 
available. Finally the combined method gives poor performances. Figure 11 shows that it 
does not correct enough and that the data still cluster by platform. Increasing the magnitude 
of the correction by using a smaller ridge v removes the platform effect but still doesn't lead 
to a good clustering by drug. As it can be seen on Figure 11, the iterative version of the 
estimator leads to a very similar and equally poor estimate. The deflation || Wen obtained 
by the naive RUV-2 and replicate based procedures are larger than the one obtained by the 
random a correction, but this is not the reason for the replicate based procedure to work 
better than the random a correction : the former is quite robust to changes in k and the 
latter does not improve when changing v. 

5.4 Benefit of control genes 

We have assumed so far that control genes had little association with X and allowed proper 
estimation for the methods we introduced. In this section, we assess this hypothesis on our 
three benchmarks. Table 3 reproduces the results on the first two benchmarks of the non- 
iterative methods that we considered and which make use of control genes. In addition for 
each method and each of our first two benchmarks, we show the performance of the same 
method using all genes as control genes. For the gender data, we give the clustering error 
when filtering in 1260 genes, which correspond to the last point of Figure 2. 

The results of MAQC-II data are not presented in Table 3 but the result of each method 
is the same whether we use our control genes or all the genes for this dataset. Overall, we can 
see that some methods are affected by the use of control genes on the gender data, but using 
all the genes only mildly affects the performances of most methods on the GBM dataset, and 
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Figure 9: Samples of the MAQC-II study represented in the space of their first two principal 
components after applying the random a correction (left panel) and its iterative variant (right 
panel). Each color represents a different drug. The labels indicate the time of each sample. 
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Figure 10: Samples of the MAQC-II study represented in the space of their first two principal 
components after applying the replicate-based correction (left panel) and its iterative variant 
(right panel). Each color represents a different drug. The labels indicate the time of each 
sample. 
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Figure 11: Samples of the MAQC-II study represented in the space of their first two principal 
components after applying the combined correction (left panel) and its iterative variant (right 
panel). Each color represents a different drug. The labels indicate the time of each sample. 



Method 


Gender 


Gender 


GBM 1 


GBM 1 


GBM 2 


GBM 2 




control 


all genes 


control 


all genes 


control 


all genes 


Naive RUV-2 


0.75 


0.92 


2 


1.52 





0.93 


Replicate-based 


0.77 


0.77 


0.2 


0.25 


0.61 


0.37 


Random a 


0.67 


0.99 


0.21 


0.24 


1.5 


1.8 


Combined 


0.45 


0.54 


1.3 


1.3 


1.2 


1.8 



Table 3: Clustering error of gender and glioblastoma data with full (1) or confounded (2) 
designs for various correction methods relying on control genes using either all genes or control 
genes. 



as we said do not affect the performances on the MAQC-II dataset at all. This suggests that 
the genes that we used as control genes were indeed less affected by the factor of interest for 
the gender data but were not for the glioblastoma and MAQC-II data. This is consistent with 
the fact that methods which rely heavily on the control genes like naive RUV-2 and random 
a are very sensitive to the amplitude of the correction for the glioblastoma dataset and do 
not work for the MAQC-II dataset. As one may expect from the discussion of Section 4.1.2, 
methods using replicates, i.e., the replicate-based one introduced in Section 4.1.1 and the 
combined one seem less affected than methods that solely rely on control genes, even on the 
gender dataset. Remember that our replicate-based procedure estimates W by regressing 
the control genes Y c against the variations observed among constrasts of replicates which can 
make it robust to the fact that control genes are affected by the factor of interest. 

In order to verify the fact that the control genes used for the gender data are good control 
genes whereas the ones used for the other datasets are not good control genes, we show the 
CCA of all control genes and all non-control genes with the factor of interest X as a boxplot 
for each dataset on Figure 12. Interestingly, the control genes used in the gender data are 
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Figure 12: boxplot of the CCA of control and non-control genes with the factor of interest X 
for the gender (left panel), glioblastoma (center panel) and MAQC-II (right panel) datasets. 




Figure 13: First canonical correlation of the factor of interest X with the space spanned by 
the k first eigenvectors of the empirical covariance computed on control genes (in red) and 
non-control genes (in black) against k for the gender (left panel), glioblastoma (center panel) 
and MAQC-II (right panel) datasets. 



typically more associated with X than the non-control genes whereas the opposite is observed 
for the glioblastoma and MAQC-II datasets. This seems to contradict the fact that control 
genes help identifying W in the gender data and does not in the two others. Since W is 
essentially estimated using PCA on Y c which is a multivariate procedure, we represent the 
first canonical correlation of X with the eigen space corresponding to the k first eigenvectors 
as a function of k on Figure 13. It is clear from the figure that for the gender dataset the 
eigen space built using control genes has a smaller association with X than the one built 
using non-control genes whereas this is much less clear for the two other datasets. 

To conclude, the case of gender data suggests that when good control genes are available 
they do help estimate and remove unwanted variation, especially for estimators which do not 
use replicate samples. The notion of good control samples seems to have more to do with the 
fact that the directions of maximum variance among these genes are not associated with X 
than with individual univariate association of the genes with X. When control genes are as 
associated as the other genes with X, methods using replicate samples still give reasonable 
estimates and other methods become either ineffective or very sensitive to the amplitude of 
the correction. 
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6 Discussion 



We proposed methods to estimate and remove unobserved unwanted variation from gene 
expression data when the factor of interest is also unobserved. In particular, this allows us 
to recover biological signal by clustering on datasets which are affected by various types of 
batch effects. The methods we introduced generalized the ideas of Gagnon-Bartsch and Speed 
(2012) and used two types of controls : negative control genes, which are assumed not to 
be affected by the factor of interest and can therefore be used to estimate the structure of 
unwanted variation, and replicate samples, whose differences are by construction not affected 
by the factor of interest. Differences of replicate samples can therefore also be used to estimate 
unwanted variation. 

On three gene expression benchmarks, we verified that when good negative control genes 
were available, the correction techniques that used these negative controls were able to suc- 
cessfully estimate and remove unwanted variation, leading to the expected clustering of the 
samples according to the factor of interest. When the negative control genes are affected 
by the factor of interest, these techniques become less efficient, but replicate-based methods 
are successful at removing unwanted variation with respect to which replicates are available. 
Correcting for unwanted variation with respect to which replicates are given may sound less 
useful, because it implies that the unwanted variation is an observed factor such as a batch or 
a platform. However, even in this case, our techniques which estimate the unwanted variation 
factor outperform methods which use the known factor. This can be explained by the fact 
that the actual factor affecting the samples may be a non-linear function of the observed 
factor. In addition, replicates with respect to an observed unwanted variation factor may 
embed unknown unwanted variation. This suggests that when both types of controls are 
available, replicate based correction is a safer option, as it removes some unwanted variation 
and in the worst case scenario leaves some other unwanted variation in the data. Negative 
control gene based methods can remove more unwanted variation but run the risk to remove 
some signal of interest if the control genes were actually affected by the factor of interest. 
While the quality of negative control genes did affect the relative ranking of the two families 
of techniques, both gave reasonable performances on the first two benchmarks, indicating 
that they could be helpful on real data for this difficult estimation problem. 

For both families of methods, we also proposed iterative versions of the techniques, which 
alternate between solving an unsupervised estimation problem for a fixed estimate of the 
unwanted variation, and estimating unwanted variation using the current estimate of the 
signal of interest. This approach often improved the latter estimate and never damaged the 
performances. 

When the objective is to do clustering, the correction may benefit from more appropriate 
surrogates than the sparse dictionary learning problem. Better techniques to choose the 
hyperparameter v for the random a based methods could also improve the performances. 
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A Proof of Proposition 1 

The left hand side of (8) is a standard ridge regression and has a closed form solution : 
a* = argmin{||i2-VPa|||. + i/||a|||} = (w T W + ul k ) ~* W T R, 

so for any R G 



sfcxn 



min {\\R- Wa\\p + u\\a\\%} 



I m - W (W 1 W + vl k \ W ) R 



fw T W + uI k \ X W T R 



tr R T ( I m - 2W ( W T W + ul k 1 W T + W (W T W + uh) W T W [W T W + vl 



+ uW {w T W + vIkj~ W T jR, 
where we used the fact that \\A\\ 2 F = tr A T A. This is tr R T (l m + WBW T ) R with : 
B = (w T W + ul k ) ~ W T W (w T W + ul k ) ~ + v (w T W + ul k ) " 2 - 2 f W T W + i/J* 



VF T Ty + ^/ fc J [w T W (W T W + vl k ) + v[W T W + vl k 

w T w + z//^ ~ x ( (w T w + iz/fc) (w T w + i/ifc) -1 - 24 
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mm 
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B Estimator of (3 for a particular £' defined on the columns 
of the residuals 



We consider the following model : 



Y = X/3 + e, 



where Y, e G M mxri , X G M mx P, /? G W xn . We further assume that each row of e is distributed 
as A/"(0, E') where E' G M nxn is a covariance matrix. Note that this is different from (3) where 
the m x m covariance was defined on the rows of e. 
(12) is equivalent to 

Y = Xfi + Wa + e, (13) 



3G 



where each column Wj of W is such that Wj *~ J\f(0, a^Im), eij *~ J\f(0, a 2 ) and a € R kxn 
for some k < m is such that a T a + a 2 I m = 

Assume k = 1 and a = 1 T , where 1 is the all-one vector in W 1 . Then £' = o 2 l n + 11 T , 
i.e. a constant matrix plus some additional constant on the diagonal. In this special case, 
W is a single standard normal column vector and (13) can be written : 

Y = X/3 + Wa + e 
= Xp + W1 T + e 

= x(/3 + (X T X)- 1 X T W1 T ) + e + R, 



where R is the projection of W1 T to the orthogonal space of X. We can disregard it be- 
cause a noise orthogonal to X has no effect on a regression against X. Denoting V = 
(X T X)~ 1 X T W1 T , we see that (12) with this particular covariance is equivalent to assuming 
Y = Xb + e, where b = (3 + V. Each column of V is equal to v = (X T X)~ 1 X T W , i.e., to the 
projection of W on X. If /3 is assumed to be non-random and we estimate it by maximum 
likelihood we recover the regular OLS : (3 + V is not identifiable. If we add a normal prior 
on /3, the MAP equation is : 

maxL(p,v\X,Y) oc maxL(A, Y\p, v)p(/3)p(v) (14) 

(3,v 0,v 

= m<ix{\ogL(X,Y\P,v) +logp(/3) + logp(v)} (15) 

P,v 

= min \\Y - X((3 + V)f F + X\\(3\\ 2 F + u\\v\\ 2 F , (16) 

where A, v depend on the prior variances of /3 and a. Then plugging b = (3 + V and denoting 
its columns by bi, 
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2 , f 

F+ A 








i - m 
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l&lllj 




where b = argmiiiy (5ZIt=i ll^i — ^'lll' + tII^IIj^) i s a shrinked average of the bi. The first 
equality is replacing f3 + V by b and /? by b — V. The second equality is moving the min^ to 
the part which depends on v. The last equality carries out the minimization over v. 

Adding a block structure to or equivalently rows to a which are 1 for some genes and 
for others, leads to an additional regularizer which penalizes the sum of squares among the 
/3i within each block. 



C PCA plots for the GBM data 
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Figure 14: GBM full design random a with and without iterations. Colors represent subtypes, 
shapes represent platforms. 
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Figure 15: GBM full design replicate-based with and without iterations. Colors represent 
subtypes, shapes represent platforms. 



38 




C\J 

O 

Q_ 



O 

o 



o 
I 



d 
I 



• A A A 



A 



o 

I 



CO 

d - 
i 



A A A A 4k A a 



▲ 



-0.2 -0.1 0.0 0.1 0.2 0.3 -0.15 
PC1 



-0.05 0.05 0.10 0.15 0.20 

PC1 



Figure 17: GBM confounded design random a with and without iterations. Colors represent 
subtypes, shapes represent platforms. 
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Figure 18: GBM confounded design replicate-based with and without iterations. Colors 
represent subtypes, shapes represent platforms. 
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Figure 19: GBM confounded design combined with and without iterations. Colors represent 
subtypes, shapes represent platforms. 
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